Introduction

Estimated reading time: 47 minutes

Introduction

Collecting the Data

Import pandas and numpy

import pandas as pd
import numpy as np

Let’s take a look at the files in our input directory, using the convenient shell commands in ipython.

!ls data

[31m311-service-requests.csv[m[m
[31mAll-Web-Site-Data-Audience-Overview.xlsx[m[m
[31mAussie_Wines_Plotting.csv[m[m
[31mMarch-2017-forecast-article.xlsx[m[m
[31mTraffic_20170306-20170519.xlsx[m[m
[31mTweets.csv[m[m
[31mairbnb_session_data.txt[m[m
[31mcustomer-status.xlsx[m[m
[31mdebtors.csv[m[m
[31mexcel_sample.xlsx[m[m
[31mgeneral-ledger-sample.xlsx[m[m
[31mmn-budget-detail-2014.csv[m[m
[31mpnl.xlsx[m[m
[31mpopulation.xlsx[m[m
[31msales-estimate.xlsx[m[m
[31msales-feb-2014.xlsx[m[m
[31msales-jan-2014.xlsx[m[m
[31msales-mar-2014.xlsx[m[m
[31msales_data_types.csv[m[m
[31msales_transactions.xlsx[m[m
[31msalesfunnel.xlsx[m[m
[31msample-sales-reps.xlsx[m[m
[31msample-sales-tax.csv[m[m
[31msample-salesv3.xlsx[m[m
[31mtb_import.xlsx[m[m
[31m~$general-ledger-sample.xlsx[m[m

There are a lot of files, but we only want to look at the sales .xlsx files.

!ls data/sales-*-2014.xlsx

[31mdata/sales-feb-2014.xlsx[m[m [31mdata/sales-jan-2014.xlsx[m[m [31mdata/sales-mar-2014.xlsx[m[m

Use the python glob module to easily list out the files we need

import glob

glob.glob("data/sales-*-2014.xlsx")

['data/sales-feb-2014.xlsx',
 'data/sales-jan-2014.xlsx',
 'data/sales-mar-2014.xlsx']

This gives us what we need, let’s import each of our files and combine them into one file.

Panda’s concat and append can do this for us. I’m going to use append in this example.

The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame.

all_data = pd.DataFrame()
for f in glob.glob("data/sales-*-2014.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good.

all_data.shape

(384, 7)

all_data.describe()

	account number	quantity	unit price	ext price
count	384.000000	384.000000	384.000000	384.000000
mean	478125.989583	24.372396	56.651406	1394.517344
std	220902.947401	14.373219	27.075883	1117.809743
min	141962.000000	-1.000000	10.210000	-97.160000
25%	257198.000000	12.000000	32.612500	482.745000
50%	424914.000000	23.500000	58.160000	1098.710000
75%	714466.000000	37.000000	80.965000	2132.260000
max	786968.000000	49.000000	99.730000	4590.810000

Alot of this data may not make much sense for this data set but I’m most interested in the count row to make sure the number of data elements makes sense.

all_data.head()

	account number	name	sku	quantity	unit price	ext price	date
0	383080	Will LLC	B1-20000	7	33.69	235.83	2014-02-01 09:04:59
1	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32	2014-02-01 11:51:46
2	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97	2014-02-01 17:24:32
3	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70	2014-02-01 19:56:48
4	672390	Kuhn-Gusikowski	S1-06532	48	55.82	2679.36	2014-02-02 03:45:20

It is not critical in this example but the best practice is to convert the date column to a date time object.

all_data['date'] = pd.to_datetime(all_data['date'])

Combining Data

Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company’s customer segmentation strategy or some other mechanism for identifying their customers.

First, we read in the data.

status = pd.read_excel("data/customer-status.xlsx")
status

	account number	name	status
0	740150	Barton LLC	gold
1	714466	Trantow-Barrows	silver
2	218895	Kulas Inc	bronze
3	307599	Kassulke, Ondricka and Metz	bronze
4	412290	Jerde-Hilpert	bronze
5	729833	Koepp Ltd	silver
6	146832	Kiehn-Spinka	silver
7	688981	Keeling LLC	silver
8	786968	Frami, Hills and Schmidt	silver
9	239344	Stokes LLC	gold
10	672390	Kuhn-Gusikowski	silver
11	141962	Herman LLC	gold
12	424914	White-Trantow	silver
13	527099	Sanford and Sons	bronze
14	642753	Pollich LLC	bronze
15	257198	Cronin, Oberbrunner and Spencer	gold

We want to merge this data with our concatenated data set of sales. We use panda’s merge function and tell it to do a left join which is similar to Excel’s vlookup function.

all_data_st = pd.merge(all_data, status, how='left')
all_data_st.head()

	account number	name	sku	quantity	unit price	ext price	date	status
0	383080	Will LLC	B1-20000	7	33.69	235.83	2014-02-01 09:04:59	NaN
1	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32	2014-02-01 11:51:46	bronze
2	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97	2014-02-01 17:24:32	bronze
3	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70	2014-02-01 19:56:48	bronze
4	672390	Kuhn-Gusikowski	S1-06532	48	55.82	2679.36	2014-02-02 03:45:20	silver

This looks pretty good but let’s look at a specific account.

all_data_st[all_data_st["account number"]==737550].head()

	account number	name	sku	quantity	unit price	ext price	date	status
15	737550	Fritsch, Russel and Anderson	S1-47412	40	51.01	2040.40	2014-02-05 01:20:40	NaN
25	737550	Fritsch, Russel and Anderson	S1-06532	34	18.69	635.46	2014-02-07 09:22:02	NaN
66	737550	Fritsch, Russel and Anderson	S1-27722	15	70.23	1053.45	2014-02-16 18:24:42	NaN
78	737550	Fritsch, Russel and Anderson	S2-34077	26	93.35	2427.10	2014-02-20 18:45:43	NaN
80	737550	Fritsch, Russel and Anderson	S1-93683	31	10.52	326.12	2014-02-21 13:55:45	NaN

This account number was not in our status file, so we have a bunch of NaN’s. We can decide how we want to handle this situation. For this specific case, let’s label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column.

all_data_st['status'].fillna('bronze',inplace=True)
all_data_st.head()

	account number	name	sku	quantity	unit price	ext price	date	status
0	383080	Will LLC	B1-20000	7	33.69	235.83	2014-02-01 09:04:59	bronze
1	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32	2014-02-01 11:51:46	bronze
2	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97	2014-02-01 17:24:32	bronze
3	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70	2014-02-01 19:56:48	bronze
4	672390	Kuhn-Gusikowski	S1-06532	48	55.82	2679.36	2014-02-02 03:45:20	silver

Check the data just to make sure we’re all good.

all_data_st[all_data_st["account number"]==737550].head()

	account number	name	sku	quantity	unit price	ext price	date	status
15	737550	Fritsch, Russel and Anderson	S1-47412	40	51.01	2040.40	2014-02-05 01:20:40	bronze
25	737550	Fritsch, Russel and Anderson	S1-06532	34	18.69	635.46	2014-02-07 09:22:02	bronze
66	737550	Fritsch, Russel and Anderson	S1-27722	15	70.23	1053.45	2014-02-16 18:24:42	bronze
78	737550	Fritsch, Russel and Anderson	S2-34077	26	93.35	2427.10	2014-02-20 18:45:43	bronze
80	737550	Fritsch, Russel and Anderson	S1-93683	31	10.52	326.12	2014-02-21 13:55:45	bronze

Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability.

Using Categories

One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -

“Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.”

For our purposes, the status field is a good candidate for a category type.

You must make sure you have a recent version of pandas installed for this example to work.

pd.__version__

'0.22.0'

First, we typecast it to a category using astype.

all_data_st["status"] = all_data_st["status"].astype("category")

This doesn’t immediately appear to change anything yet.

all_data_st.head()

	account number	name	sku	quantity	unit price	ext price	date	status
0	383080	Will LLC	B1-20000	7	33.69	235.83	2014-02-01 09:04:59	bronze
1	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32	2014-02-01 11:51:46	bronze
2	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97	2014-02-01 17:24:32	bronze
3	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70	2014-02-01 19:56:48	bronze
4	672390	Kuhn-Gusikowski	S1-06532	48	55.82	2679.36	2014-02-02 03:45:20	silver

Buy you can see that it is a new data type.

all_data_st.dtypes

account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
status                  category
dtype: object

Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically.

all_data_st.sort_values(by=["status"]).head()

	account number	name	sku	quantity	unit price	ext price	date	status
0	383080	Will LLC	B1-20000	7	33.69	235.83	2014-02-01 09:04:59	bronze
196	218895	Kulas Inc	S2-83881	41	78.27	3209.07	2014-01-20 09:37:58	bronze
197	383080	Will LLC	B1-33364	26	90.19	2344.94	2014-01-20 09:39:59	bronze
198	604255	Halvorson, Crona and Champlin	S2-11481	37	96.71	3578.27	2014-01-20 13:07:28	bronze
200	527099	Sanford and Sons	B1-05914	18	64.32	1157.76	2014-01-20 21:40:58	bronze

We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering.

 all_data_st["status"].cat.set_categories([ "gold","silver","bronze"],inplace=True)

Now, we can sort it so that gold shows on top.

all_data_st.sort_values(by=["status"]).head()

	account number	name	sku	quantity	unit price	ext price	date	status
68	740150	Barton LLC	B1-38851	17	81.22	1380.74	2014-02-17 17:12:16	gold
63	257198	Cronin, Oberbrunner and Spencer	S1-27722	28	10.21	285.88	2014-02-15 17:27:44	gold
207	740150	Barton LLC	B1-86481	20	30.41	608.20	2014-01-22 16:33:51	gold
61	740150	Barton LLC	B1-20000	28	81.39	2278.92	2014-02-15 07:45:16	gold
60	239344	Stokes LLC	S2-83881	30	43.00	1290.00	2014-02-15 02:13:23	gold

all_data_st["status"].describe()

count        384
unique         3
top       bronze
freq         172
Name: status, dtype: object

For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values.

all_data_st.groupby(["status"])["quantity","unit price","ext price"].mean()

	quantity	unit price	ext price
status
gold	24.375000	53.723889	1351.944583
silver	22.842857	57.272714	1320.032214
bronze	25.616279	57.371163	1472.965930

Of course, you can run multiple aggregation functions on the data to get really useful information

all_data_st.groupby(["status"])["quantity","unit price","ext price"].agg([np.sum,np.mean, np.std])

	quantity			unit price			ext price
	sum	mean	std	sum	mean	std	sum	mean	std
status
gold	1755	24.375000	14.575145	3868.12	53.723889	28.740080	97340.01	1351.944583	1182.657312
silver	3198	22.842857	14.512843	8018.18	57.272714	26.556242	184804.51	1320.032214	1086.384051
bronze	4406	25.616279	14.136071	9867.84	57.371163	26.857370	253350.14	1472.965930	1116.683843

So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.

Maybe we should look at how many bronze customers we have and see what is going on.

What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.

I’m purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean.

all_data_st.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1,7]].groupby(["status"])["name"].count()

status
gold      4
silver    7
bronze    9
Name: name, dtype: int64

Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers.

all_data_st.head(4)

	account number	name	sku	quantity	unit price	ext price	date	status
0	383080	Will LLC	B1-20000	7	33.69	235.83	2014-02-01 09:04:59	bronze
1	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32	2014-02-01 11:51:46	bronze
2	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97	2014-02-01 17:24:32	bronze
3	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70	2014-02-01 19:56:48	bronze

We will start over and import a clean file

df = pd.read_excel("data/sample-sales-reps.xlsx")

## Set default commision of 3%
df["commission"] = .03
df.head()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.03
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.03
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.03
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.03
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.03

Since shirts are high margin, adjust all products in the shirt categort with a commission rate of 5%

df.loc[df["category"] == "Shirt", ["commission"]] = .05
df.head()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.03
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.05
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.05
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.05
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.05

Since there is a special program for selling 10 or more belts in a transaction, you get 7% commission!

df.loc[(df["category"] == "Belt") & (df["quantity"] >= 10), ["commission"]] = .04
df.head()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.04
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.05
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.05
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.05
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.05

Finally, some transactions can get a bonus and a commission increase.

df["bonus"] = 0
df.loc[(df["category"] == "Shoes") & (df["ext price"] >= 1000 ), ["bonus", "commission"]] = 250, 0.045

Calculate the compensation at the line item level

df["comp"] = df["commission"] * df["ext price"] + df["bonus"]
df.head()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission	comp
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.04	67.2524
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.05	11.7105
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.05	28.6995
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.05	54.1740
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.05	18.1600

Calculate the commissions by sales rep

df.groupby(["sales rep"])["comp"].sum().round(2)

sales rep
Ansley Cummings       2699.69
Beth Skiles           3664.16
Esequiel Schinner    12841.28
Loring Predovic      13115.42
Shannen Hudson        6541.78
Teagan O'Keefe       10931.30
Trish Deckow          7641.91
Name: comp, dtype: float64

df["date"] = pd.to_datetime(df['date'])

df["month"] = df["date"].dt.month

Now what about on a monthly basis

df.head()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission	comp	month
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.04	67.2524	11
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.05	11.7105	2
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.05	28.6995	8
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.05	54.1740	1
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.05	18.1600	8

df.groupby(["month","sales rep"])["comp"].sum().round(2)

month  sales rep        
1      Ansley Cummings       157.10
       Beth Skiles            78.09
       Esequiel Schinner    1481.78
       Loring Predovic       655.96
       Shannen Hudson        319.13
       Teagan O'Keefe        732.10
       Trish Deckow          305.58
2      Ansley Cummings       347.06
       Beth Skiles          1262.36
       Esequiel Schinner     741.87
       Loring Predovic      1794.22
       Shannen Hudson        524.58
       Teagan O'Keefe        893.54
       Trish Deckow          430.81
3      Ansley Cummings       362.13
       Beth Skiles           439.53
       Esequiel Schinner    1323.87
       Loring Predovic      1183.59
       Shannen Hudson        474.90
       Teagan O'Keefe       1064.76
       Trish Deckow          796.12
4      Ansley Cummings       123.27
       Beth Skiles           121.12
       Esequiel Schinner    1478.78
       Loring Predovic       907.41
       Shannen Hudson        514.77
       Teagan O'Keefe        593.64
       Trish Deckow          347.04
5      Ansley Cummings       101.29
       Beth Skiles           347.73
                             ...   
8      Teagan O'Keefe        477.33
       Trish Deckow         1493.39
9      Ansley Cummings       109.70
       Beth Skiles           145.43
       Esequiel Schinner     178.20
       Loring Predovic       978.46
       Shannen Hudson        765.28
       Teagan O'Keefe       1086.80
       Trish Deckow          435.42
10     Ansley Cummings       510.49
       Beth Skiles           204.40
       Esequiel Schinner     994.76
       Loring Predovic      1954.61
       Shannen Hudson        990.48
       Teagan O'Keefe       1111.47
       Trish Deckow          625.12
11     Ansley Cummings        80.41
       Beth Skiles           464.13
       Esequiel Schinner     928.05
       Loring Predovic      1253.52
       Shannen Hudson        567.31
       Teagan O'Keefe        554.03
       Trish Deckow          281.69
12     Ansley Cummings       288.82
       Beth Skiles           131.40
       Esequiel Schinner    1235.69
       Loring Predovic       660.44
       Shannen Hudson        421.24
       Teagan O'Keefe        421.16
       Trish Deckow          848.57
Name: comp, Length: 84, dtype: float64

You can do this for the entire numerical dataframe, withou creating a month variable like follows

df.set_index('date').groupby('sales rep').resample("M").sum().head(20)

		account number	quantity	unit price	ext price	commission	bonus	comp	month
sales rep	date
Ansley Cummings	2015-06-30	3593984	59	196.43	2214.13	0.165	250	342.04250	24
	2015-07-31	3593984	43	152.77	1460.69	0.160	0	58.43250	28
	2015-08-31	7187968	77	437.26	4080.37	0.340	0	171.32280	64
	2015-09-30	6289472	64	398.20	2691.38	0.280	0	109.69920	63
	2015-10-31	6289472	86	437.41	5803.84	0.315	250	510.48870	70
	2015-11-30	3593984	25	259.40	1715.97	0.160	0	80.40890	44
	2015-12-31	9883456	139	465.39	6820.11	0.450	0	288.81630	132
	2016-01-31	7187968	85	413.52	4346.59	0.310	0	157.10410	8
	2016-02-29	4492480	56	233.69	2561.57	0.185	250	347.06005	10
	2016-03-31	3593984	43	260.11	2553.24	0.175	250	362.13085	12
	2016-04-30	4492480	55	276.44	2855.68	0.200	0	123.26960	20
	2016-05-31	3593984	36	282.24	2151.62	0.190	0	101.28800	20
	2016-06-30	3593984	15	257.26	980.25	0.180	0	47.62630	24
Beth Skiles	2015-06-30	846366	37	161.44	1763.43	0.150	0	88.17150	18
	2015-07-31	1692732	64	235.55	2468.03	0.190	0	81.93400	42
	2015-08-31	3385464	123	699.43	6275.63	0.470	0	283.95210	96
	2015-09-30	1692732	68	326.76	3374.01	0.240	0	145.42650	54
	2015-10-31	3103342	106	591.17	5248.15	0.410	0	204.40310	110
	2015-11-30	1974854	78	412.76	4496.55	0.315	250	464.13370	77
	2015-12-31	2539098	55	480.51	2970.65	0.360	0	131.39730	108

What if you are only interested in Mondays

df.set_index('date').groupby('sales rep').resample("W-Mon").sum().head(20)

		account number	quantity	unit price	ext price	commission	bonus	comp	month
sales rep	date
Ansley Cummings	2015-06-15	1796992	39	72.79	1436.45	0.075	250	309.1255	12
	2015-06-22	898496	18	33.15	596.70	0.040	0	23.8680	6
	2015-06-29	898496	2	90.49	180.98	0.050	0	9.0490	6
	2015-07-06	898496	11	22.99	252.89	0.050	0	12.6445	7
	2015-07-13	898496	7	83.34	583.38	0.030	0	17.5014	7
	2015-07-20	898496	17	28.10	477.70	0.050	0	23.8850	7
	2015-07-27	898496	8	18.34	146.72	0.030	0	4.4016	7
	2015-08-03	0	0	0.00	0.00	0.000	0	0.0000	0
	2015-08-10	2695488	36	112.58	1191.71	0.120	0	50.4873	24
	2015-08-17	0	0	0.00	0.00	0.000	0	0.0000	0
	2015-08-24	2695488	30	170.84	1964.79	0.140	0	89.4372	24
	2015-08-31	1796992	11	153.84	923.87	0.080	0	31.3983	16
	2015-09-07	0	0	0.00	0.00	0.000	0	0.0000	0
	2015-09-14	2695488	47	89.22	1432.40	0.120	0	59.1218	27
	2015-09-21	1796992	8	162.20	640.40	0.100	0	32.0200	18
	2015-09-28	0	0	0.00	0.00	0.000	0	0.0000	0
	2015-10-05	1796992	9	146.78	618.58	0.060	0	18.5574	18
	2015-10-12	3593984	42	193.83	2216.34	0.170	0	87.6666	40
	2015-10-19	0	0	0.00	0.00	0.000	0	0.0000	0
	2015-10-26	2695488	44	243.58	3587.50	0.145	250	422.8221	30

for a specific month use: df.groupby([‘name’, ‘sku’, pd.Grouper(key=’date’, freq=’A-DEC’)])[‘ext price’].sum()

Here is an aggregation function that is sometimes usefull

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})

	ext price	quantity	unit price
mean	571.75346	10.411333	55.316493
sum	857630.19000	15617.000000	NaN

You can create custom functions

get_max = lambda x: x.value_counts(dropna=False).index[0]
get_max.__name__ = "most frequent" # required for row label

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})

	ext price	quantity	unit price	sku
mean	571.75346	10.411333	55.316493	NaN
most frequent	NaN	NaN	NaN	TL-23025
sum	857630.19000	15617.000000	NaN	NaN

It is nice to have an ordered dictionary

import collections
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)

	ext price	quantity	sku
mean	571.75346	10.411333	NaN
most frequent	NaN	NaN	TL-23025
sum	857630.19000	15617.000000	NaN

You can of course use any of the extensions available like below to further analysis and filtering.

import qgrid
from IPython.display import display

qgrid_widget = qgrid.show_grid(df, show_toolbar=True)

## Not sure why this is not working - giving it a skip for now

qgrid_widget

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

qgrid_widget.get_changed_df()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission	bonus	comp	month
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.040	0	67.2524	11
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.050	0	11.7105	2
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.050	0	28.6995	8
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.050	0	54.1740	1
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.050	0	18.1600	8
5	282122	Connelly, Abshire and Von	Beth Skiles	GJ-90272	Shoes	20	96.62	1932.40	2016-03-17 10:19:05	0.045	250	336.9580	3
6	398620	Brekke Ltd	Esequiel Schinner	DU-87462	Shirt	10	67.64	676.40	2015-11-25 22:05:36	0.050	0	33.8200	11
7	218667	Jaskolski-O'Hara	Trish Deckow	DU-87462	Shirt	11	91.86	1010.46	2016-04-24 15:05:58	0.050	0	50.5230	4
8	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	18	98.67	1776.06	2015-08-06 08:09:56	0.050	0	88.8030	8
9	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	GP-14407	Belt	12	64.48	773.76	2016-01-08 09:52:04	0.040	0	30.9504	1
10	887145	Gislason LLC	Loring Predovic	NZ-99565	Shirt	20	92.87	1857.40	2016-05-22 06:09:58	0.050	0	92.8700	5
11	93583	Hegmann-Howell	Esequiel Schinner	HZ-54995	Belt	17	40.56	689.52	2015-07-05 01:05:52	0.040	0	27.5808	7
12	280749	Douglas PLC	Teagan O'Keefe	TK-29646	Shoes	17	42.61	724.37	2016-05-16 19:11:55	0.030	0	21.7311	5
13	453258	Runolfsson-Bayer	Shannen Hudson	FI-01804	Shirt	20	12.31	246.20	2015-08-24 21:32:26	0.050	0	12.3100	8
14	453258	Runolfsson-Bayer	Shannen Hudson	EO-54210	Shirt	20	67.95	1359.00	2015-11-24 15:04:30	0.050	0	67.9500	11
15	280749	Douglas PLC	Teagan O'Keefe	GJ-90272	Shoes	17	23.20	394.40	2015-09-12 13:16:56	0.030	0	11.8320	9
16	453258	Runolfsson-Bayer	Shannen Hudson	NZ-99565	Shirt	2	30.23	60.46	2016-03-21 11:31:24	0.050	0	3.0230	3
17	280749	Douglas PLC	Teagan O'Keefe	GP-14407	Belt	3	22.64	67.92	2015-12-02 01:55:28	0.030	0	2.0376	12
18	398620	Brekke Ltd	Esequiel Schinner	HZ-54995	Belt	1	36.20	36.20	2015-12-07 18:45:54	0.030	0	1.0860	12
19	251881	Zulauf-Grady	Teagan O'Keefe	DU-87462	Shirt	9	82.06	738.54	2016-02-23 17:41:41	0.050	0	36.9270	2
20	575704	Lindgren, Thompson and Kirlin	Teagan O'Keefe	DU-87462	Shirt	5	28.51	142.55	2016-03-01 19:34:33	0.050	0	7.1275	3
21	887145	Gislason LLC	Loring Predovic	ZY-38455	Shirt	15	95.60	1434.00	2015-10-08 03:01:22	0.050	0	71.7000	10
22	575704	Lindgren, Thompson and Kirlin	Teagan O'Keefe	TK-29646	Shoes	11	30.80	338.80	2015-11-25 06:07:47	0.030	0	10.1640	11
23	282122	Connelly, Abshire and Von	Beth Skiles	HZ-54995	Belt	6	64.98	389.88	2016-01-15 21:07:30	0.030	0	11.6964	1
24	758030	Kilback-Abernathy	Trish Deckow	TK-29646	Shoes	19	26.90	511.10	2015-09-28 14:33:34	0.030	0	15.3330	9
25	280749	Douglas PLC	Teagan O'Keefe	NZ-99565	Shirt	15	30.34	455.10	2016-03-29 21:19:08	0.050	0	22.7550	3
26	680916	Mueller and Sons	Loring Predovic	DU-87462	Shirt	2	61.01	122.02	2016-05-19 00:05:12	0.050	0	6.1010	5
27	898496	Weissnat, Veum and Barton	Ansley Cummings	TL-23025	Shoes	8	18.34	146.72	2015-07-24 17:48:46	0.030	0	4.4016	7
28	530925	Purdy and Sons	Teagan O'Keefe	DU-87462	Shirt	10	18.40	184.00	2016-03-05 23:45:57	0.050	0	9.2000	3
29	752312	Watsica-Pfannerstill	Loring Predovic	GP-14407	Belt	15	17.93	268.95	2016-03-18 07:17:35	0.040	0	10.7580	3
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1470	680916	Mueller and Sons	Loring Predovic	HZ-54995	Belt	18	17.93	322.74	2016-04-28 01:57:00	0.040	0	12.9096	4
1471	218667	Jaskolski-O'Hara	Trish Deckow	TK-29646	Shoes	13	57.80	751.40	2015-11-09 12:17:52	0.030	0	22.5420	11
1472	887145	Gislason LLC	Loring Predovic	GP-14407	Belt	5	51.82	259.10	2016-06-08 11:25:05	0.030	0	7.7730	6
1473	93583	Hegmann-Howell	Esequiel Schinner	ZY-38455	Shirt	10	86.25	862.50	2016-05-10 23:48:57	0.050	0	43.1250	5
1474	530925	Purdy and Sons	Teagan O'Keefe	FI-01804	Shirt	13	10.13	131.69	2016-04-01 20:39:41	0.050	0	6.5845	4
1475	752312	Watsica-Pfannerstill	Loring Predovic	HZ-54995	Belt	18	23.48	422.64	2015-07-25 05:51:10	0.040	0	16.9056	7
1476	758030	Kilback-Abernathy	Trish Deckow	TK-29646	Shoes	7	86.67	606.69	2015-08-19 15:35:31	0.030	0	18.2007	8
1477	398620	Brekke Ltd	Esequiel Schinner	GP-14407	Belt	7	16.52	115.64	2016-04-07 10:25:42	0.030	0	3.4692	4
1478	218667	Jaskolski-O'Hara	Trish Deckow	GJ-90272	Shoes	6	36.44	218.64	2015-11-02 20:55:11	0.030	0	6.5592	11
1479	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	HZ-54995	Belt	5	48.52	242.60	2016-05-22 12:34:35	0.030	0	7.2780	5
1480	218667	Jaskolski-O'Hara	Trish Deckow	TL-23025	Shoes	14	88.33	1236.62	2015-07-12 15:59:56	0.045	250	305.6479	7
1481	530925	Purdy and Sons	Teagan O'Keefe	HZ-54995	Belt	9	62.85	565.65	2015-08-12 17:07:20	0.030	0	16.9695	8
1482	575704	Lindgren, Thompson and Kirlin	Teagan O'Keefe	GP-14407	Belt	9	55.57	500.13	2016-02-27 03:20:13	0.030	0	15.0039	2
1483	758030	Kilback-Abernathy	Trish Deckow	GJ-90272	Shoes	8	98.87	790.96	2016-02-10 16:51:59	0.030	0	23.7288	2
1484	136521	Labadie and Sons	Esequiel Schinner	FI-01804	Shirt	15	99.43	1491.45	2015-12-17 21:58:31	0.050	0	74.5725	12
1485	898496	Weissnat, Veum and Barton	Ansley Cummings	GJ-90272	Shoes	18	11.05	198.90	2016-04-01 10:19:26	0.030	0	5.9670	4
1486	136521	Labadie and Sons	Esequiel Schinner	HZ-54995	Belt	20	54.67	1093.40	2016-02-20 10:43:41	0.040	0	43.7360	2
1487	218667	Jaskolski-O'Hara	Trish Deckow	TL-23025	Shoes	19	45.97	873.43	2015-08-12 22:53:38	0.030	0	26.2029	8
1488	575704	Lindgren, Thompson and Kirlin	Teagan O'Keefe	TL-23025	Shoes	1	55.81	55.81	2016-02-01 17:50:13	0.030	0	1.6743	2
1489	887145	Gislason LLC	Loring Predovic	ZY-38455	Shirt	10	16.37	163.70	2015-10-11 22:11:45	0.050	0	8.1850	10
1490	898496	Weissnat, Veum and Barton	Ansley Cummings	GP-14407	Belt	8	83.37	666.96	2016-01-27 03:26:53	0.030	0	20.0088	1
1491	62592	O'Keefe-Koch	Shannen Hudson	TK-29646	Shoes	18	24.92	448.56	2016-02-11 17:13:23	0.030	0	13.4568	2
1492	530925	Purdy and Sons	Teagan O'Keefe	HZ-54995	Belt	3	90.33	270.99	2015-09-23 07:36:34	0.030	0	8.1297	9
1493	93583	Hegmann-Howell	Esequiel Schinner	GP-14407	Belt	5	45.93	229.65	2016-04-13 22:34:56	0.030	0	6.8895	4
1494	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	ZY-38455	Shirt	16	21.60	345.60	2015-11-18 06:28:56	0.050	0	17.2800	11
1495	136521	Labadie and Sons	Esequiel Schinner	GP-14407	Belt	4	98.57	394.28	2016-06-01 17:28:44	0.030	0	11.8284	6
1496	575704	Lindgren, Thompson and Kirlin	Teagan O'Keefe	TK-29646	Shoes	3	65.16	195.48	2016-04-02 16:38:31	0.030	0	5.8644	4
1497	898496	Weissnat, Veum and Barton	Ansley Cummings	EO-54210	Shirt	17	28.10	477.70	2015-07-20 19:30:10	0.050	0	23.8850	7
1498	62592	O'Keefe-Koch	Shannen Hudson	FI-01804	Shirt	19	94.96	1804.24	2015-10-05 15:55:01	0.050	0	90.2120	10
1499	530925	Purdy and Sons	Teagan O'Keefe	DU-87462	Shirt	3	22.86	68.58	2015-10-07 19:49:38	0.050	0	3.4290	10

1500 rows × 13 columns

df.head()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission	comp	month
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.04	67.2524	11
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.05	11.7105	2
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.05	28.6995	8
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.05	54.1740	1
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.05	18.1600	8

df.groupby("category").agg({"quantity":["count","size"],"unit price":["sum"],"ext price":['mean']})

	quantity		unit price	ext price
	count	size	sum	mean
category
Belt	298	298	15754.05	548.017852
Shirt	734	734	41696.16	581.485817
Shoes	468	468	25524.53	571.603162

What is nice about the pivot table, is that you have three dataframe parameters, index, columns, and values to adjust, wheras the groupby function only use index and values.

df.pivot_table(index=["month","sales rep"],columns=["category"], values=["bonus","comp"],aggfunc=[np.sum, np.size],fill_value="No Sale").head(12)

		sum						size
		bonus			comp			bonus			comp
	category	Belt	Shirt	Shoes	Belt	Shirt	Shoes	Belt	Shirt	Shoes	Belt	Shirt	Shoes
month	sales rep
1	Ansley Cummings	0	0	0	84.7025	34.094	38.3076	3	3	2	3	3	2
	Beth Skiles	0	0	0	24.4992	35.6795	17.9136	3	2	1	3	2	1
	Esequiel Schinner	0	0	750	120.595	348.077	1013.11	3	13	8	3	13	8
	Loring Predovic	0	0	250	37.9589	252.68	365.321	6	8	4	6	8	4
	Shannen Hudson	0	0	0	116.422	201.026	1.6809	3	6	1	3	6	1
	Teagan O'Keefe	0	0	0	42.5652	660.813	28.7259	2	21	3	2	21	3
	Trish Deckow	0	0	0	113.095	172.417	20.0688	7	8	3	7	8	3
2	Ansley Cummings	No Sale	0	250	No Sale	10.0695	336.991	No Sale	1	4	No Sale	1	4
	Beth Skiles	No Sale	0	1000	No Sale	17.809	1244.55	No Sale	2	5	No Sale	2	5
	Esequiel Schinner	0	0	250	153.831	174.106	413.933	6	8	9	6	8	9
	Loring Predovic	0	0	1000	82.4416	366.779	1345	4	11	10	4	11	10
	Shannen Hudson	0	0	250	38.8498	152.086	333.645	3	4	3	3	4	3

In the privot table below, only certain calculations are applied to certain columns. It is more selective.

# this is whithout brackets, so the type of calculation is not displayed
#df.pivot_table(index=["month","sales rep"],columns=["category"], values=["comp","bonus"],aggfunc={"comp":np.sum, "bonus":np.size},fill_value="No Sale").head(12)

df.pivot_table(index=["month","sales rep"],aggfunc={"comp":[np.sum], "bonus":[np.size]},columns=["category"], values=["comp","bonus"],fill_value="No Sale").head(12)

		bonus			comp
		size			sum
	category	Belt	Shirt	Shoes	Belt	Shirt	Shoes
month	sales rep
1	Ansley Cummings	3	3	2	84.7025	34.094	38.3076
	Beth Skiles	3	2	1	24.4992	35.6795	17.9136
	Esequiel Schinner	3	13	8	120.595	348.077	1013.11
	Loring Predovic	6	8	4	37.9589	252.68	365.321
	Shannen Hudson	3	6	1	116.422	201.026	1.6809
	Teagan O'Keefe	2	21	3	42.5652	660.813	28.7259
	Trish Deckow	7	8	3	113.095	172.417	20.0688
2	Ansley Cummings	No Sale	1	4	No Sale	10.0695	336.991
	Beth Skiles	No Sale	2	5	No Sale	17.809	1244.55
	Esequiel Schinner	6	8	9	153.831	174.106	413.933
	Loring Predovic	4	11	10	82.4416	366.779	1345
	Shannen Hudson	3	4	3	38.8498	152.086	333.645

df_pivot = df.pivot_table(index=["month","sales rep"],aggfunc={"comp":[np.sum], "bonus":[np.size]},columns=["category"], values=["comp","bonus"],fill_value="No Sale")

You can now if you feel the need to, do some querying

df_pivot.query("month == [1]")

		bonus			comp
		size			sum
	category	Belt	Shirt	Shoes	Belt	Shirt	Shoes
month	sales rep
1	Ansley Cummings	3	3	2	84.7025	34.094	38.3076
	Beth Skiles	3	2	1	24.4992	35.6795	17.9136
	Esequiel Schinner	3	13	8	120.595	348.077	1013.11
	Loring Predovic	6	8	4	37.9589	252.68	365.321
	Shannen Hudson	3	6	1	116.422	201.026	1.6809
	Teagan O'Keefe	2	21	3	42.5652	660.813	28.7259
	Trish Deckow	7	8	3	113.095	172.417	20.0688

This is another way to do it, I find it more reliable

df_pivot[df_pivot.index.get_level_values(0).isin([1])]

		bonus			comp
		size			sum
	category	Belt	Shirt	Shoes	Belt	Shirt	Shoes
month	sales rep
1	Ansley Cummings	3	3	2	84.7025	34.094	38.3076
	Beth Skiles	3	2	1	24.4992	35.6795	17.9136
	Esequiel Schinner	3	13	8	120.595	348.077	1013.11
	Loring Predovic	6	8	4	37.9589	252.68	365.321
	Shannen Hudson	3	6	1	116.422	201.026	1.6809
	Teagan O'Keefe	2	21	3	42.5652	660.813	28.7259
	Trish Deckow	7	8	3	113.095	172.417	20.0688

df_pivot[df_pivot.index.get_level_values(1).isin(["Ansley Cummings"])]

		bonus			comp
		size			sum
	category	Belt	Shirt	Shoes	Belt	Shirt	Shoes
month	sales rep
1	Ansley Cummings	3	3	2	84.7025	34.094	38.3076
2	Ansley Cummings	No Sale	1	4	No Sale	10.0695	336.991
3	Ansley Cummings	No Sale	2	2	No Sale	48.57	313.561
4	Ansley Cummings	1	2	2	50.112	62.678	10.4796
5	Ansley Cummings	1	3	No Sale	25.172	76.116	No Sale
6	Ansley Cummings	2	4	2	25.9473	54.596	309.125
7	Ansley Cummings	No Sale	2	2	No Sale	36.5295	21.903
8	Ansley Cummings	2	4	2	51.198	90.2805	29.8443
9	Ansley Cummings	2	3	2	39.3054	49.317	21.0768
10	Ansley Cummings	1	4	2	20.1152	154.215	336.159
11	Ansley Cummings	2	2	No Sale	8.0844	72.3245	No Sale
12	Ansley Cummings	2	5	4	73.1176	164.834	50.8647

df.head()

	account number	customer name	sales rep	sku	category	quantity	unit price	ext price	date	commission	comp	month
0	680916	Mueller and Sons	Loring Predovic	GP-14407	Belt	19	88.49	1681.31	2015-11-17 05:58:34	0.04	67.2524	11
1	680916	Mueller and Sons	Loring Predovic	FI-01804	Shirt	3	78.07	234.21	2016-02-13 04:04:11	0.05	11.7105	2
2	530925	Purdy and Sons	Teagan O'Keefe	EO-54210	Shirt	19	30.21	573.99	2015-08-11 12:44:38	0.05	28.6995	8
3	14406	Harber, Lubowitz and Fahey	Esequiel Schinner	NZ-99565	Shirt	12	90.29	1083.48	2016-01-23 02:15:50	0.05	54.1740	1
4	398620	Brekke Ltd	Esequiel Schinner	NZ-99565	Shirt	5	72.64	363.20	2015-08-10 07:16:03	0.05	18.1600	8